<!DOCTYPE html>
<html lang="en-us">
  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <!-- Enable responsiveness on mobile devices-->
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
  <title>
    
      Reinforcement learning &middot; AIMA Exercises 
    
  </title>
  <!-- CSS -->
  <link rel="stylesheet" href="/aima-exercises/public/css/poole.css">
  <link rel="stylesheet" href="/aima-exercises/public/css/syntax.css">
  <link rel="stylesheet" href="/aima-exercises/public/css/lanyon.css">
  <link rel="stylesheet" href="/aima-exercises/public/css/style.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">
  <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.8.1/css/all.css" integrity="sha384-50oBUHEmvpQ+1lW4y57PTFmhCaXp0ML5d60M1M7uH2+nqUivzIebhndOJK28anvf" crossorigin="anonymous">

      <!-- Bootstrap CSS -->
      <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">

  <!-- Icons -->
  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="/aima-exercises/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="/aima-exercises/public/aima_logo.ico">

  <!-- RSS -->
  <link rel="alternate" type="application/rss+xml" title="RSS" href="/atom.xml">
</head>

  <body>
    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>Artificial Intelligence : A Modern Approach</p>
  </div>

  <nav class="sidebar-nav">
    <a class="sidebar-nav-item" href="/aima-exercises/">Home</a>
    <span class="sidebar-nav-item">Part - I Artificial Intelligence</span>
  <a class="sidebar-nav-item" href="/aima-exercises/intro-exercises/">Chapter 1 - Introduction</a>
  <a class="sidebar-nav-item" href="/aima-exercises/agents-exercises/">Chapter 2 - Intelligent Agents</a>
  <span class="sidebar-nav-item">Part - II Problem Solving</span>
  <a class="sidebar-nav-item" href="/aima-exercises/search-exercises/">Chapter 3 - Solving Problems By Searching</a>
  <a class="sidebar-nav-item" href="/aima-exercises/advanced-search-exercises">Chapter 4 - Beyond Classical Search</a>
  <a class="sidebar-nav-item" href="/aima-exercises/game-playing-exercises">Chapter 5 - Adversarial Search</a>
  <a class="sidebar-nav-item" href="/aima-exercises/csp-exercises">Chapter 6 - Constraint Satisfaction Problems</a>
  <span class="sidebar-nav-item">Part - III Knowledge, Reasoning and Planning</span>
  <a class="sidebar-nav-item" href="/aima-exercises/knowledge-logic-exercises">Chapter 7 - Logical Agents</a>
  <a class="sidebar-nav-item" href="/aima-exercises/fol-exercises">Chapter 8 - First Order Logic</a>
  <a class="sidebar-nav-item" href="/aima-exercises/logical-inference-exercises">Chapter 9 - Inference in First Order Logic</a>
  <a class="sidebar-nav-item" href="/aima-exercises/planning-exercises">Chapter 10 - Classical Planning</a>
  <a class="sidebar-nav-item" href="/aima-exercises/advanced-planning-exercises">Chapter 11 - Planning and Acting in Real Life</a>
  <a class="sidebar-nav-item" href="/aima-exercises/kr-exercises">Chapter 12 - Knowledge Representation</a>
  <span class="sidebar-nav-item">Part - IV Uncertaing Knowledge and Reasoning</span>
  <a class="sidebar-nav-item" href="/aima-exercises/probability-exercises">Chapter 13 - Quantifying Uncertainty</a>
  <a class="sidebar-nav-item" href="/aima-exercises/bayes-nets-exercises">Chapter 14 - Probabilistic Reasoning</a>
  <a class="sidebar-nav-item" href="/aima-exercises/dbn-exercises">Chapter 15 - Probabilistic Reasoning Over Time</a>
  <a class="sidebar-nav-item" href="/aima-exercises/decision-theory-exercises">Chapter 16 - Making-Simple Decisions</a>
  <a class="sidebar-nav-item" href="/aima-exercises/complex-decisions-exercises">Chapter 17 - Making Complex Decisions</a>
  <span class="sidebar-nav-item">Part - V Lerning</span>
  <a class="sidebar-nav-item" href="/aima-exercises/concept-learning-exercises">Chapter 18 - Learning From Examples</a>
  <a class="sidebar-nav-item" href="/aima-exercises/ilp-exercises">Chapter 19 - Knowledge In Learning</a>
  <a class="sidebar-nav-item" href="/aima-exercises/bayesian-learning-exercises">Chapter 20 - Learning Probabilistic Models</a>
  <a class="sidebar-nav-item" href="/aima-exercises/reinforcement-learning-exercises">Chapter 21 - Reinforcement Learning</a>
  <span class="sidebar-nav-item">Part - VI Communicating, Perceiving and Acting</span>
  <a class="sidebar-nav-item" href="/aima-exercises/nlp-communicating-exercises">Chapter 22 - Natural Language Processing</a>
  <a class="sidebar-nav-item" href="/aima-exercises/nlp-english-exercises">Chapter 23 - Natural Language For Communication</a>
  <a class="sidebar-nav-item" href="/aima-exercises/perception-exercises">Chapter 24 - Perception</a>
  <a class="sidebar-nav-item" href="/aima-exercises/robotics-exercises">Chapter 25 - Robotics</a>
  <span class="sidebar-nav-item">Part - VII Conclusions</span>
  <a class="sidebar-nav-item" href="/aima-exercises/philosophy-exercises">Chapter 26 - Philosophical Foundations</a>
  <a class="sidebar-nav-item" href="/aima-exercises/#/">Chapter 27 - AI The Present And Future</a>
    <span class="sidebar-nav-item">Currently v1.0.0</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2019. All rights reserved.
    </p>
  </div>
</div>

    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/aima-exercises/" title="Home">Artificial Intelligence</a>
            <small>AIMA Exercises </small>
          </h3>
          <br>
          <center>
            <form class="form-inline active-pink-3 active-pink-4" action="/aima-exercises/search" id="site_search" autocomplete="off" method="GET">
              <i class="fas fa-search" aria-hidden="true"></i>
            <input class="form-control form-control-sm ml-3 w-75" type="text" placeholder="Search within AIMA Exercises" aria-label="Search" name="query">
            <input type="submit" value="Go!" class="search-btn">
            </form>
            <br>
            </center>
            



<ul class="breadcrumbb" id="bbreadcrumb">

  <label for="toggletoc" class="toc-icon">
    <span></span>
    <span></span>
    <span></span>
  </label>

   
   
    <li><a class="breadcrumb-text" href="/aima-exercises/"><i class="fa fa-home"></i></a>  </li>
   


</ul>

      </div>
    </div>
      <div class="container content">
        <article class="post">

  <div class="entry">
    <script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    TeX: {
      equationNumbers: {
        autoNumber: "AMS"
      }
    },
    tex2jax: {
      inlineMath: [ ['$','$'] ],
      displayMath: [ ['$$','$$'] ],
      processEscapes: true,
    },
    "HTML-CSS": { 
      preferredFont: "TeX", 
      availableFonts: ["STIX","TeX"], 
      styles: {".MathJax": {}} 
    }
  });
</script>

<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>

<h1 id="21-reinforcement-learning">21. Reinforcement Learning</h1>

<div class="card">
<div class="card-header p-2">
<a href="ex_1/" class="p-2">Exercise 1 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex1');" href="#"><i id="ch21ex1" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.1);" href="#"><i id="ch21ex1" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Implement a passive learning agent in a simple environment, such as the
$4\times 3$ world. For the case of an initially unknown environment
model, compare the learning performance of the direct utility
estimation, TD, and ADP algorithms. Do the comparison for the optimal
policy and for several random policies. For which do the utility
estimates converge faster? What happens when the size of the environment
is increased? (Try environments with and without obstacles.)
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_2/" class="p-2">Exercise 2 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex2');" href="#"><i id="ch21ex2" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.2);" href="#"><i id="ch21ex2" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Chapter <a class="chapterRef" href="/aima-exercises/concept-decisions-exercise/">complex-decisions-chapter</a> defined a
<b>proper policy</b> for an MDP as one that is
guaranteed to reach a terminal state. Show that it is possible for a
passive ADP agent to learn a transition model for which its policy $\pi$
is improper even if $\pi$ is proper for the true MDP; with such models,
the POLICY-EVALUATION step may fail if $\gamma1$. Show that this problem cannot
arise if POLICY-EVALUATION is applied to the learned model only at the end of a trial.
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_3/" class="p-2">Exercise 3 (prioritized-sweeping-exercise) </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex3');" href="#"><i id="ch21ex3" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.3);" href="#"><i id="ch21ex3" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Starting with the passive ADP agent,
modify it to use an approximate ADP algorithm as discussed in the text.
Do this in two steps:<br />

1.  Implement a priority queue for adjustments to the utility estimates.
    Whenever a state is adjusted, all of its predecessors also become
    candidates for adjustment and should be added to the queue. The
    queue is initialized with the state from which the most recent
    transition took place. Allow only a fixed number of adjustments.<br />

2.  Experiment with various heuristics for ordering the priority queue,
    examining their effect on learning rates and computation time.
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_4/" class="p-2">Exercise 4 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex4');" href="#"><i id="ch21ex4" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.4);" href="#"><i id="ch21ex4" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

The direct utility estimation method in
Section <a class="sectionRef" title="" href="#">passive-rl-section</a> uses distinguished terminal
states to indicate the end of a trial. How could it be modified for
environments with discounted rewards and no terminal states?
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_5/" class="p-2">Exercise 5 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex5');" href="#"><i id="ch21ex5" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.5);" href="#"><i id="ch21ex5" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Write out the parameter update equations for TD learning with
$$\hat{U}(x,y) = \theta_0 + \theta_1 x + \theta_2 y + \theta_3\,\sqrt{(x-x_g)^2 + (y-y_g)^2}\ .$$
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_6/" class="p-2">Exercise 6 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex6');" href="#"><i id="ch21ex6" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.6);" href="#"><i id="ch21ex6" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Adapt the vacuum world (Chapter <a class="chapterRef" href="/aima-exercises/agents-exercises/">agents-chapter</a> for
reinforcement learning by including rewards for squares being clean.
Make the world observable by providing suitable percepts. Now experiment
with different reinforcement learning agents. Is function approximation
necessary for success? What sort of approximator works for this
application?
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_7/" class="p-2">Exercise 7 (approx-LMS-exercise) </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex7');" href="#"><i id="ch21ex7" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.7);" href="#"><i id="ch21ex7" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">Implement an exploring reinforcement learning
agent that uses direct utility estimation. Make two versions—one with a
tabular representation and one using the function approximator in
Equation (<a class="equationRef" title="" href="#">4x3-linear-approx-equation</a>). Compare their
performance in three environments:<br />

1.  The $4\times 3$ world described in the chapter.<br />

2.  A ${10}\times {10}$ world with no obstacles and a +1 reward
    at (10,10).<br />

3.  A ${10}\times {10}$ world with no obstacles and a +1 reward
    at (5,5).
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_8/" class="p-2">Exercise 8 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex8');" href="#"><i id="ch21ex8" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.8);" href="#"><i id="ch21ex8" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Devise suitable features for reinforcement learning in stochastic grid
worlds (generalizations of the $4\times 3$ world) that contain multiple
obstacles and multiple terminal states with rewards of $+1$ or $-1$.
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_9/" class="p-2">Exercise 9 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex9');" href="#"><i id="ch21ex9" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.9);" href="#"><i id="ch21ex9" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Extend the standard game-playing environment
(Chapter <a class="chapterRef" href="/aima-exercises/game-playing-exercises/">game-playing-chapter</a>) to incorporate a reward
signal. Put two reinforcement learning agents into the environment (they
may, of course, share the agent program) and have them play against each
other. Apply the generalized TD update rule
(Equation (<a class="equationRef" title="" href="#">generalized-td-equation</a>)) to update the
evaluation function. You might wish to start with a simple linear
weighted evaluation function and a simple game, such as tic-tac-toe.
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_10/" class="p-2">Exercise 10 (10x10-exercise) </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex10');" href="#"><i id="ch21ex10" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.10);" href="#"><i id="ch21ex10" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Compute the true utility function and the best linear
approximation in $x$ and $y$ (as in
Equation (<a class="equationRef" title="" href="#">4x3-linear-approx-equation</a>)) for the
following environments:<br />

1.  A ${10}\times {10}$ world with a single $+1$ terminal state
    at (10,10).<br />

2.  As in (a), but add a $-1$ terminal state at (10,1).<br />

3.  As in (b), but add obstacles in 10 randomly selected squares.<br />

4.  As in (b), but place a wall stretching from (5,2) to (5,9).<br />

5.  As in (a), but with the terminal state at (5,5).<br />

The actions are deterministic moves in the four directions. In each
case, compare the results using three-dimensional plots. For each
environment, propose additional features (besides $x$ and $y$) that
would improve the approximation and show the results.
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_11/" class="p-2">Exercise 11 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex11');" href="#"><i id="ch21ex11" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.11);" href="#"><i id="ch21ex11" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Implement the REINFORCE and PEGASUS algorithms and apply them to the $4\times 3$ world,
using a policy family of your own choosing. Comment on the results.
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_12/" class="p-2">Exercise 12 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex12');" href="#"><i id="ch21ex12" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.12);" href="#"><i id="ch21ex12" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Investigate the application of reinforcement learning ideas to the
modeling of human and animal behavior.
</p>
</div>
</div>
<p><br /></p>
<div class="card">
<div class="card-header p-2">
<a href="ex_13/" class="p-2">Exercise 13 </a>
<button type="button" class="btn btn-dark float-right" title="Bookmark Exercise" onclick="bookmark('ch21ex13');" href="#"><i id="ch21ex13" class="fas fa-bookmark" style="color:white"></i></button>
<button type="button" class="btn btn-dark float-right" style="margin-left:10px; margin-right:10px;" title="Upvote Exercise" onclick="upvote('ex21.13);" href="#"><i id="ch21ex13" class="fas fa-thumbs-up" style="color:white"></i></button>
</div>
<div class="card-body">
<p class="card-text">

Is reinforcement learning an appropriate abstract model for evolution?
What connection exists, if any, between hardwired reward signals and
evolutionary fitness?
</p>
</div>
</div>
<p><br /></p>

  </div>

<!--   <div class="date">
    Written on 
  </div>
 -->
  
</article>


      </div>
    <label for="sidebar-checkbox" class="sidebar-toggle"></label>
    <script>
      (function(document) {
        var toggle = document.querySelector('.sidebar-toggle');
        var sidebar = document.querySelector('#sidebar');
        var checkbox = document.querySelector('#sidebar-checkbox');
        document.addEventListener('click', function(e) {
          var target = e.target;
          if(!checkbox.checked ||
             sidebar.contains(target) ||
             (target === checkbox || target === toggle)) return;
          checkbox.checked = false;
        }, false);
      })(document);
    </script>
        <script src="/aima-exercises/js/main.js"></script>
        <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
        <script src="/aima-exercises/js/answer.js"></script>
        <script src="/aima-exercises/js/commsol.js"></script>
        <script src="/aima-exercises/js/forms.js"></script>
        <script src="/aima-exercises/js/crossref.js"></script>
        <script src="/aima-exercises/js/bookmark.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
  </body>
</html>
